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ABSTRACT 

With  advances  in  computer  architecture  and  software,  Newton  methods 
are  becoming  not  only  feasible  for  large-scale  nonlinear  optimization 
problems,  but  also  reliable,  fast  and  efficient.  Truncated  Newton  methods,  in 
particular,  are  emerging  as  a  versatile  subclass.  In  this  paper  we  present  a 
truncated  Newton  algorithm  specifically  developed  for  potential  energy 
minimization.  The  method  is  globally  convergent  with  local  quadratic 
convergence.  Its  key  ingredients  are:  1)  approximation  of  the  Newton 
direction  far  way  from  local  minima,  2)  solution  of  the  Newton  equation 
iteratively  by  the  linear  Conjugate  Gradient  method,  and  3)  preconditioning 
of  the  Newton  equation  by  the  analytic  second-derivative  components  of  the 
"local''  chemical  interactions:  bond  length,  bond  angle  and  torsional 
potentials.  Relaxation  of  the  required  accuracy  of  the  Newton  search  direction 
diverts  the  minimization  search  away  from  regions  where  the  function  is  non- 
convex  and  towards  physically  interesting  regions.  The  preconditioning 
strategy  significantly  accelerates  the  iterative  solution  for  the  Newton  search 
direction,  and  therefore  reduces  the  computation  time  for  each  iteration. 
With  algorithmic  variations,  the  truncated  Newton  method  can  be  formulated 
so  that  storage  and  computational  requirements  are  comparable  to  that  of  the 
nonlinear  Conjugate  Gradient  method.  As  the  convergence  rate  of  nonlinear 
Conjugate  Gradient  methods  is  linear  and  performance  less  predictable,  the 
application  of  the  truncated  Newton  code  to  potential  energy  functions  is 
promising. 
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1.    INTRODUCTION 

Minimization  of  the  energy  function  is  an  important  and  challenging  component  in  the 
prediction  of  three-dimensional  structures  of  large  biological  molecules  by  potential  energy 
methods.  Potential  energy  functions  are  generally  large  (hundreds  to  thousands  of  variables). 
nonlinear,  and  continuously  differentiable.  The  minimization  problem  can  be  formulated  as 
unconstrained.  Furthermore,  the  analytic  derivatives  of  the  potential  energy  can  generally  be 
written  compactly  as  pairwise  and  individual  sums  of  relatively  simple  functions  in  the  cartesian 
variables  of  the  atoms.  Thus,  minimization  algorithms  that  rely  on  the  availability  of  these 
derivatives  and  exploit  the  problem  structure  can  be  particularly  effective. 

Newton  methods  have  always  been  attractive,  since  nonlinear  programming  algorithms  are 
rapidly  convergent  to  a  local  minimum  when  the  search  direction  is  a  sufficiently  good 
approximation  to  the  Newton  direction.  However,  the  storage  and  computational  requirements  of 
Newton  methods  have  restricted  their  application  in  practice.  For  an  n  —  dimensional  problem, 
storage  of  the  analytic  Hessian  demands  O(n')  (i.e.  proportional  to  n  )  memory  locations,  and 
direct  solution  of  the  Newton  equation  0(n  )  additions  and  multiplications.  Thus,  in  general, 
Newton  methods  have  been  applied  only:  1)  to  small  problems,  2)  to  problems  with  special 
sparsity  pattern,  and  3)  near  a  solution  after  a  gradient  method  has  been  applied  [1-3]. 
Fortunately,  advances  in  computing  machinery  and  software  are  making  the  Newton  approach  of 
approximating  the  objective  function  by  a  local  quadratic  model  feasible  for  a  wide-range  of 
problems.  Indeed.  Newton  methods  are  increasing  in  popularity,  versatility  and  effectiveness  for 
large-scale  nonlinear  optimization  problems  and  will  undoubtedly  continue  to  do  so  [4-12]. 

In  this  paper,  we  describe  a  truncated  Newton  method  based  on  the  preconditioned  linear 
Conjugate  Gradient  (CG)  method.  The  linear  CG  is  the  method  used  for  approximating  the 
Newton  search  direction.  The  method  we  describe,  based  on  the  algorithm  of  Dembo  and 
Steihaug  [6],  has  been  specifically  adapted  to  the  problem  of  potential  energy  minimization.  Our 
version    incorporates   preconditioning   by   exploiting   the   particular   separability   structure   of  the 


Hessian  matrix  to  construct  a  powerful  preconditioner  for  the  Conjugate  Gradient  algorithm.  The 
key  feature  of  truncated  Newton  methods  is  concentration  of  computational  effort  only  in 
physically  important  regions  (i.e.  near  the  minimum).  When  our  preconditioning  strategy  is 
combined  with  this  feature,  the  algorithm  becomes  globally  [Nl]  and  rapidly  convergent. 

In  general,  the  performance  of  truncated  Newton  methods  for  large-scale  problems  is 
superior  to  gradient  methods  as  well  as  to  other  Newton  variants.  Although  the  nonlinear  CG 
method  [N2. 13-17]  is  attractive  because  of  its  modest  computational  and  storage  requirements,  its 
performance  in  general  is  unpredictable  unless  we  know  a  priori  that  the  eigenvalue  structure  of 
the  Hessian  at  the  solution  is  favorable.  As  Newton  methods  are  generally  more  reliable  and 
efficient  than  CG  methods,  the  benefits  derived  from  derivative  computation  and  manipulation  far 
exceed  the  cost  of  the  additional  program  complexity. 

The  application  of  the  truncated  Newton  method  to  potential  energy  functions  is  promising, 
since  the  storage  requirements  of  the  truncated  Newton  and  nonlinear  CG  methods  can  be  made 
comparable.  For  example,  in  the  code  described  here,  only  the  Hessian  elements  from  the  bond 
length,  bond  angle  and  torsional  potentials  need  to  be  computed  analytically  and  stored.  (There 
are  only  O(n)  such  terms,  not  0{n')).  This  approach  can  thereby  avoid  the  costh  analytic 
manipulation  of  the  pairwise  Lennard-Jones  and  Coulombic  potentials.  Most  important,  the 
exceptionally  rapid  convergence  rate  of  classic  Newton  methods  is  retained. 

We  apply  in  this  paper  several  variants  of  the  truncated  Newton  algorithm  to  the  potential 
energy  function  of  the  nucleic  acid  component  deoxyribose.  We  compare  the  performance  of  these 
methods  to  full  Newton  methods,  and  to  two  state-of-the-art  versions  of  the  nonlinear  CG  and  the 
quasi-Newton  method  BFGS.  Details  of  the  potential  energy  fuction  are  provided  in  the 
accompanying  paper  [30]. 

In  the  next  section,  we  discuss  the  basic  structure  of  Newton  algorithms  and  present  a 
perspective  on  the  application  of  various  classes  of  Newton  methods  to  large-scale  problems. 
Section  3  describes  the  truncated  Newton  version  that  we  have  implemented,  and  Section  4 
provides  the  computational  results  and  interpretations.  Conclusions  as  well  as  future  directions  for 
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nucleic  acid  applications  are  discussed  in  Section  5. 
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2.     NEWTON-TYPE  METHODS 


Throughout  this  paper,  we  generally  use  the  following  notational  conventions.  Scalars  are 
denoted  by  lower-case  Greek  letters  (a  ,  (3  ,  \),  column  vectors  by  bold  faced  lower-case  Roman 
letters  (x  .  p  ).  and  matrices  by  capital  Roman  letters.  Unless  stated  otherwise,  all  vectors 
belong  to  R"  ,  the  n  —  dimensional  vector  space.  A  superscript  T  denotes  a  vector  transpose  so 
that  \T  represents  a  row  vector.     A     a  matrix  transpose  and  x  y  an  inner  product     We  associate 


i  : 


the  standard  Euclidian    norm  with  each  vector  in  R* 


.  defined  as  ||x|j  =  •!  ^  |.v,| 
1  =  1 


We  are  interested  in  solving  the  optimization  problem  without  constraints 


minimize      fix)  .  x  €  R" 


(1) 


locally,   where  £   is  a   real-valued   function    of  n   (or   3,V)    cartesian   variables   of  the   .V   atoms: 
E(\)  =  E(xi  ,  xj  ,  ...  ,  xn).    We  will  assume  that  E  is  twice  continuously  differentiate  and  denote 


the  gradient  off  at  x  by    the  vector  g(x)  with  n  components      £,(x) 


dEjx) 

dx, 


.  and  the  Hessian 


matrix   of  second  derivatives  at  x  as   Hix)    with  n     components  Hu(\) 


d'Elx) 
d.v,  dXj 


We   are 


seeking    a    strong  local  minimum     x*,    a    point    for    which    there    exists    an    t]  >  0    such    that 
£(x  )  <  £(x)       and      H(x')    is   positive-definite    [N3]   for  all  x^x*    within   a  distance   t|   of  x": 

l|x-x*||<-n, 

Newton  methods  are  based  on  approximating  the  objective  function  locally  by  a  quadratic 
model  and  minimizing  or  approximately  minimizing  that  model.  If  we  denote  by  x*  the  current 
approximation  to  the  solution  vector  x*  .  and  by  g^  and  Hk  the  gradient  and  Hessian  evaluated  at 
\k,  respectively,  then  the  new  estimate  for  x*  is  obtained  from  the  Taylor  series  expansion  along  a 
search  direction  p: 
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E(\k  +  p)  ~  E(xk)  +g/p+TPrfftP  (2) 

The  minimum  of  the  quadratic  function  in  the  right  hand  side  will  be  achieved  if  pk  satisfies  the 
linear  system  known  as  the  Newton  equation: 

HkPk  =  "git-  (3) 

In  the  classic  Newton  method,  a  sequence  of  iterates  {  x0  ,  \[  .*:  »•■■  1    >s    constructed  from  a 
given  point  Xq  by  the  formula 


l*-i 


x*  +  P*  (4) 


-. ' :  gfli  •  0. 
where  pk  is  the  exact  solution  of  (3).  When  E  is  sufficiently  smooth  this  produces  an  algorithm 

which  is  locally  and  quadratically  convergent  [11,  for  example].    That  is.  given  a  sufficiently  close 

estimate   Xq  to   x  .  there  exists  a  constant  p  (the  convergence  ratio)  such  that 

||x,  +  1-x*||<p|k-x*||2.  •        (5) 

In  practical  terms,  this  means  that  the  number  of  digits  of  accuracy  is  doubled  at  every  step! 

In  practice,  however,  modifications  are  necessary  to  produce  an  effective  and  globally 
convergent  method  with  this  exceptional  convergence  rate.  First,  when  Hk  is  not  positive-definite, 
the  search  direction  may  not  exist  or  may  not  be  a  descent  direction  ( p  is  a  descent  direction  it 
SkPk  <  0.  see  [N4]);  strategies  to  produce  a  related  positive-definite  matrix  Hk  or  alternative 
search  directions  become  necessary.  Second,  far  away  from  a  solution  x  the  quadratic 
approximation  is  inaccurate,  and  consequently,  the  Newton  direction  must  be  scaled  (dampened); 
a  line  search  algorithm  is  introduced  to  minimize  approximately  the  objective  function  along  that 
search  direction.  The  line  search,  which  is  an  approximate  one-dimensional  minimization  (usually 
carried  out  using  polynomial  interpolation),  ensures  sufficient  function  reduction  [N5]  and  thereby 
guarantees  [13, for  example]  that  uniform  progress  is  made  toward  the  solution  at  every  step. 
These    modifications    lead    to    a   modified   Newton    method,   embodied    in    the    following   general 
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framework: 


* 

Supply 

an  initial  guess  x0 

* 

For  k 

i 

=  012 

Test  tor  con\  ergence:  if  ||  g;     <  e.  stop 

•) 

Compute  a  positive-definite 
approximation  Hk  to  W; 

3. 

Solve  for  the  search  direction  p.  so  that 

II  #7 Pit  +  g*  II  ^  <t>t  ||  gk  || 

where  ^^  is  some    prescribed  quantity  that 
"controls"  the  accuracy  of  the  computed  p; 

4. 

Compute  an  appropriate  step  length  kK  to 
ensure  "sufficient  decrease"  (defined  later). 

5. 

Set   x^]  =  xk  +  \k  pk 

(Al) 


Many  Newton  variants  are  constructed  by  combining  various  srrategies  for  the  individual 
components  of  the  general  algorithm  above.  Specifically,  these  involve  choices  in  the  following 
computational  categories: 

1  )      Formulation  of  the  Hessian    - 

Analytic,   by   finite-differences,  or   by    some    low    rank   update   a\   the   previously    constructed 
Hessian 

2)      Strategies  for  an  Indefinite  Hessian    - 

Based    on    the    modified    Cholesky    factorization    |11.    p  10S    or     18],    Symmetric    Indefinite 
factorization  [19],  or  possibly  others  [11.  p .  1 13]. 


3)      Solution  process  for  the  Newton  Equation    - 
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Iterative  or  direct.  If  iterative,  then  by  the  linear  Conjugate  Gradient  method  [20].  the 
Conjugate  Residual  method  [21],  or  others.  How  accurate  should  the  computed  solution  be  ? 
And  uhen  Hk  is  indefinite  as  well  as  ||  gk  ||  small,  what  descent  direction  should  be  used  [8 
or  11.  p. 113]  ? 

4)      Strategies  for  the  Line  Search    - 

What  criteria  of  sufficient  decrease  to  use  and  what  algorithm  to  implement  ? 

For  example,  when  Hk  is  approximated  by  finite -differences  [N6],  a  subclass  known  as 
Discrete  Newton  methods  is  defined:  when  Hk  (  or  Hk  )  is  approximated  by  a  low  rank  update  of 
the  previouly  constructed  Hessian.  Quasi  — Newton  methods  emerge:  when  <£>k  is  nonzero. 
Truncated  Newton  methods  are  formed.  The  computational  costs,  storage  requirements  and 
convergence  properties  [N7]  of  these  algorithms  differ  considerably  and  depend  on  the  problem's 
size,  form  and  choice  of  parameters. 

Standard  Discrete  Newton  methods  require  n  gradient  evaluations  and  O(n')  operations  to 
compute  and  symmetrize  every  Hessian  Hk  .  With  exact  arithmetic  (i.e.  in  absence  of  round-off 
error),  they  will  converge  quadratically  if  the  finite-difference  interval  goes  to  zero  asymptotically 
with  ||  g  ||  .  Nevertheless,  typically  round-off  error  limits  the  accuracy  that  can  be  obtained. 
Consequently,  finite-difference  Newton  methods  are  inappropriate  for  large-scale  problems  unless 
the  Hessian  has  a  known  sparsity  structure  [N8]  which  can  be  exploited  to  reduce  considerably  the 
computational  effort  and,  hence,  the  potential  for  round-off  error. 

Quasi-Newton  methods  can  be  particularly  useful  for  large-scale  problems  when  the  low  rank 
Hessian  update  is  exploited  to  solve  more  efficiently  the  linear  system  for  pk  .  This  can  be  done, 
for  example,  by  updating  the  Cholesky  factors  of  the  matrix  (see  Section  4)  rather  than  the  matrix 
elements  themselves  [22].  Once  the  factors  are  available,  the  linear  system  can  be  solved  by 
0(n~)  operations  rather  than  0(n  ).  Alternatively,  limited  memory  quasi-Newton  methods  [11, p. 
150]  can  reduce  the  usual  storage  requirements  of  Oin')  by  storing  only  the  vectors  that  define 
the  Hessian  update  and  by  using  these  vectors  to  define  the  new  pk  .  Although  quasi-Newton 
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methods  avoid  finite-difference  approximations  as  well  as  formulation  of  the  analytic  second- 
derivatives,  they  may  suffer  seriously  from  rounding  errors:  the  updated  Hessian  approximation 
may  become  singular  or  indefinite  even  when  the  analytic  formula  guarantees  hereditary  positive- 
definiteness.  Moreover,  their  behavior  may  be  particularly  inefficient  in  regions  where  the 
curvature  of  the  analytic  Hessian  changes  most  rapidly.  L'nder  the  assumption  that  the  initial 
Hessian  approximation  is  sufficiently  close  to  the  Hessian  at  the  solution,  their  convergence  rate  is 
superlinear  [N7],  an  intermediate  between  the  linear  convergence  of  gradient  methods  and  the 
quadratic  convergence  of  other  Newton  variants.  Thus,  in  general,  one  must  combine  the  various 
components  most  appropriate  to  the  problem  at  hand  to  produce  a  powerful  Newton  method. 

Truncated  Newton  methods  appear  ideal  for  potential  energy  minimization.  First,  the  size  of 
potential  energy  functions  and  the  complexity  of  conformational  space  justify  the  approximation  of 
search  directions  early  in  the  search.  Second,  the  compact  form  of  the  energy  function  permits 
efficient  analytic  differentiation.  Recently,  a  general  interest  in  and  development  of  this  class  of 
algorithms  has  increased.  Indeed,  the  approach  of  approximating  the  search  direction  far  away 
from  a  local  minimum  can  be  as  effective  as  methods  based  on  accurate  solutions  for  all  iterations. 
By  requiring  far  less  work  in  function  and  derivative  evaluation  and  manipulation,  truncated 
Newton  methods  can  be  much  faster.  In  fact,  under  the  natural  assumption  that  the 
relative  residual  \\  Hk  p*.  +  g*  ||  /  ||  g*  ||  or  <Pk  is  less  than  1  for  all  k  and  converges  to 
zero    as  ||  g^  ||  does,  quadratic  convergence  can  be  attained  [4,6]. 

In  summary,  the  main  strengths  of  truncated  Newton  methods  for  large-scale  applications  is 
their  "adaptive"  philosophy  and  algorithmic  versatility:  they  allow  the  user,  by  varying  a  single 
parameter,  to  determine  the  trade-off  between  the  cost  per  iteration  and  the  overall  cost  to  obtain 
a  solution.  With  variations  in  storage  and  computational  schemes,  they  can  be  applied  to  a  wide 
range  of  large-scale  nonlinear  optimization  problems. 
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3.     THE  TRUNCATED  NEWTON  ALGORITHM 


3.1   Theory 

In  a  truncated  Newton  method,  the  solution  to  the  Newton  equation  (3)  or  possibly  the 
modified  Newton  equation  is  approximated  rather  than  solved  to  completion  by  some  iterative 
scheme.  This  produces  a  nested  iteration:  an  inner  loop  to  determine  p^  from  some  sequence 
{  Pit  *  P/t  >  Pit  .  ■  •  •  }  f°r  eacn  xk  m  tne  outer  loop  iteration.  Since  the  Newton  system  was 
originally  obtained  from  the  requirement  that  a  quadratic  rnodel  be  minimized,  any  iterative 
minimization  algorithm  or  iterative  linear  solver  can  be  considered. 

The  linear  Conjugate  Gradient  method  is  particularly  attractive  because  of  its  modest  storage 
requirements  (in  our  application  O(n)),  modest  computational  requirements  (only  several  vector 
operations  and  one  matrix/ vector  product  must  be  computed  at  every  iteration),  and  fast 
convergence  when  preconditioned  (Preconditioning  is  discussed  later).  Such  an  iterative  method  is 
ideal  for  truncation. 

At  this  point,  a  remark  on  terminology  is  essential.  The  name  truncated  Newton  is  slightly 
misleading;  it  is  not  the  Newton  iteration  (outer  loop)  which  is  truncated  but  rather  the  solution 
process  to  obtain  the  Newton  search  direction  (inner  loop).  These  algorithms  are  also  called 
inexact  Newton  methods  [4].  Since  this  name  might  imply  imprecision  rather  than  flexibility,  we 
will  adhere  to  the  term  truncated  Newton. 

Why  can  one  afford  to  approximate  the  solution  for  the  Newton  direction  ?  As  a  quadratic 
model  is  necessarily  inaccurate  far  awav  from  a  solution,  the  effort  for  solving  the  svstem  exactlv 
is  unjustified;  any  descent  direction  will  suffice  to  produce  sufficient  decrease.  Moreover,  for  a 
distant  initial  guess,  the  approximate  strategy  might  actually  lead  us  more  quickly  to  the  relevant 
regions  of  the  multi-dimensional  function  surface. 


3.2    Practical  Implementation 

For  a  truncated  Newton  method  to  attain  quadratic  .convergence  near  the  solution,  it  is 
necessary  to  require  that  near  a  local  minimum,  the  Newton  equation  will  be  solved  exactly.  The 
truncation  criterion  must  then  enforce  this.  Since  the  gradient  norm  provides  a  measure  of  the 
relative  progress  that  is  made  toward  the  solution  at  every  step,  it  is  a  natural  choice  as  a  variable 
in  the  forcing  sequence  {  <t>t  }.  One  suggested  function  is  [6]: 

<t>fc  =  min  |  -  ,  ||  gk  \\  \  .  (6) 

With  this  dependency  on  ||  g^    [.  the  truncation  criterion 

:'T['E*p*  +  g*  il^  <*>*  II  g*  II  <7> 

is  easily  satisfied  at  regions  distant  from  critical  points;  as  a  critical  point  is  approached,  the 
condition  becomes  more  stringent,  and  this  leads  to  an  increasingly  accurate  solution  for  p^..  By 
including  a  sequence  that  is  less  than  1  for  all  k  and  converges  to  0  in  the  definition  of  <$>k  (  1  Ik 
above),  the  residual  ||  Hk  pk  +  gk  ||  can  be  forced  to  be  smaller  when  progress  toward  a  solution  is 
slow.  That  is,  when  ||  gk  ||  is  large  and  many  Newton  iterations  have  been  performed,  a  more 
accurate  search  direction  might  be  necessary.  With  properly  chosen  parameters  in  the  CG 
algorithm  (see  algorithm  (A3)),  it  can  be  guaranteed  that  the  truncation  criterion  (7)  is  satisfied 
near  a  strong  local  minimum  x*  [6]. 

When  the  linear  CG  method,  originally  designed  for  positive-definite  systems,  is  applied  to 
the  linear  system  Hk  pk  =  -  gk  ,  adaptations  must  be  made  in  the  event  that  the  Hessian  Hk  is 
indefinite.  Two  approaches  are  possible.  Dembo  and  Steihaug  halt  the  CG  loop  when  a  direction 
of  negative  curvature  is  detected  and  exit  from  it  with  a  guaranteed  direction  of  descent.  Nash  [7] 
and  O'Leary  [8]  adapt  the  indefinite  CG  version  [23]  based  on  the  Lanczos  tridiagonalization  of 
symmetric  matrices  [24].  Our  code  incorporates  the  former.  This  strategy  resulted  in  rapid 
progress  at  indefinite  regions,  and  we  believe  that  the  Lanczos  approach  will  not  produce  any 


significant  advantage. 

In  addition  to  modifying  the  standard  CG  code  to  indefinite  systems,  preconditioning  [25.26] 
is  essential  to  obtain  fast  convergence.  This  is  particularly  important  in  the  truncated  Newton 
context  since  the  overall  success  of  the  method  relies  heavily  on  obtaining  a  good  search  direction 
in  a  relatively  small  number  of  inner  loop  iterations  (generally  <<  n).  Furthermore,  for  large 
problems  such  as  potential  energy  functions,  acceleration  of  CG  is  essential  in  order  to  reduce  the 
accumulation  of  round-off  errors.  Although  the  method  converges  theoretically  in  at  most  n  steps. 
round-off  errors  destroy  the  mutual  conjugacy  property  of  the  search  directions  (  d,Ady  =  0  for 
all  i=£j  ,  see  [13]  for  example).  As  a  result,  considerably  more  iterations,  typically  2/;  to  5n  ,  are 
performed. 

Preconditioned  Conjugate  Gradient.  The  idea  in  preconditioning  the  linear  system  Hp  =  -g 
(we  suppress  now  the  k  subscripts  for  simplicity)  is  to  construct  a  transformation  on  H  so  that  a 
better  convergence  rate  is  realized.  For  the  linear  system  Ax  =  -  b  where  A  is  positive-definite, 
the  convergence  properties  depend  on  the  entire  matrix  spectrum:  faster  convergence  is  expected 
when  the  eigenvalues  are  clustered.  In  particular,  if  A  has  only  m  distinct  eigenvalues,  then,  in 
exact  arithmetic,  convergence  to  a  solution  requires  only  m  iterations. 

In  preconditioning,  we  attempt  to  replace  the  original  linear  system  by  one  with  a  more 
clustered  eigenvalue  structure.  A  positive-definite  matrix  M  must  be  determined  so  that  when  its 
inverse  is  applied  to  the  original  system  to  produce 

M'1  Hp  =  -  M~lg  .  (8) 

the  relevant  matrix  .V/_1  H  has  many  eigenvalues  close  to  unity.  Although  M~l  H  is  not 
necessarily  symmetric,  a  related  system  can  be  formulated  from  which  the  CG  algorithm  is 
derived. 

The  preconditioning  matrix  M  should  satisfy  three  general  criteria: 

1)      M  should  be  easy  to  compute; 
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2)  The  linear  system  M  z  =  r  should  be  easy  to  solve  relative  to  H  p  =  -g  .  and 

3)  The  eigenvalue  properties  of  .I/-1  H    should  be  significantly  better  than  those  o[  H  for 
CG  implementation.    Preferably,  the  eigenvalues  of  M~    H  should  be  close  to  unity. 

The  first  two  conditions  are  important  because  the  the  usual  CG  procedure  is  implemented 
on  the  modified  system  (8).  In  addition  to  the  usual  matrix/ vector  product,  the  linear  system 
M  z  =  r  must  be  solved  at  each  iteration.  Thus.  M  should  be  simultaneously  simple  and 
powerful.  When  M  =  I  (the  Identity  matrix),  the  system  is  unchanged  and  thus  criterion  3  is 
violated.  Toward  the  other  extreme,  the  closer  M  is  to  H  ,  the  closer  A/-1  H  is  to  the  identity 
matrix.  In  this  case,  though,  criteria  1  and  2  are  violated.  The  choice  for  .1/  must  then  involve 
some  clever  balance. 

A  natural  intermediate  is  to  choose  for  M  some  part  of  H.  This  type  of  preconditioning 
technique  is  called  splitting  and  has  actually  evolved  from  a  traditional  class  of  methods  for 
solving  linear  systems,  known  as  relaxation  methods.  The  idea  is  similar:  instead  of  solving 
H  p  =  -  g,  split  Xt  into  two  component  matrices  as  H  =  M  -  N  .  and  given  some  starting  vector 
Po,  solve  for  p  iteratively  by  the  formula  M  p;.-!  =  .V  pk  -  g.  Thus,  if  a  particular  splitting 
design  produces  a  preconditioner  M  that  satisfies  the  first  two  criteria  above,  the  possible  benefits 
can  be  tested  upon  implementation. 

A  preconditioner  for  potential  energy  functions.  A  possible  preconditioner  is  automatically 
provided  by  a  careful  examination  of  the  Hessian  matrix  associated  with  potential  energy 
functions.  Each  entry  of  such  a  Hessian  is  composed  of  a  sum  of  partial  derivatives  from  the 
various  energy  components.    In  other  words, 

d2E(\)    _   (  d2Eu(x)         d2ECOiL(x)    \ 
dx,  dXj         \     dx,  dxj  dx,  dx 


i  ■"»_/ 


d"^flO.VD<: 
dx,   dXj 


-J*'  d-CaAVG(x)  d-E2F0LD(\)  d-E}F0LD(\)    ) 

^  ^      -  +        ,     , +  — r— +  — - — (9) 

d.v,  dXj  dx,  dXj  dx,  dv;  j 


where      i  ,  j  -  1 n      and      E u  .  EC0LL  ,  EB0SD  .  EBANG  .  E1F0LD  .  and  E3F0LD       are      the 


-13- 

Lennard-Jones,  Coulombic,  Bond  length.  Bond  angle.  Twofold  torsion  and  Threefold  torsion 
potentials,  respectively.  The  contributions  from  EBOsd  ■  Ebaxg  ■  ^ifold  ar)d  EyF0LD  are  a"  local 
in  the  sense  that  their  derivatives  are  zero  with  respect  to  all  but  the  few  variables  that  appear  in 
the  bond  length,  bond  angle  or  torsional  potentials.  If  these  energy  components  alone  are 
considered,  the  Hessian  will  be  sparse.  The  non-bonded  pairwise  interactions  (Eu  and  EC0LL  ) 
are  the  long-range  interactions,  and  especially  the  slowly-decaying  Coulombic  potential  contributes 
significantly  even  at  large  distances.  Hence,  their  effect  is  to  make  the  Hessian  matrix  dense. 

By  taking  the  preconditioning  matrix  to  consist  of  only  the  second  partial  derivatives  with 
respect  to  the  local  or  short-range  interactions,  we  construct  a  sufficiently  simple  and  sparse 
matrix.  This  preconditioner  is  simple  because  it  is  constructed  automatically  as  derivatives  are 
computed  from  the  various  energy  terms.  The  solution  of  linear  systems  for  sparse  matrices  is 
significantly  faster  than  for  dense  matrices,  and  so  the  system  M  z  =  r  is  solved  much  faster  than 
ffp=  -g. 

In  general,  the  sparsity  pattern  of  M  can  be  exploited  in  direct  factorization  methods.  For 
band  matrices,  for  example,  (a  matrix  M  has  bandwidth  p  when  mtJ  =  0  for  |  i  —  j  |  <  /;),  the 
band  structure  is  preserved,  so  that  reduction  in  time  and  storage  is  considerable  when  p  «  n.  If 
M  is  symmetric  and  sparse  but  the  location  of  zeros  differs  from  row  to  row,  profile- 
storage  schemes  are  efficient.  Such  schemes  involve  storing  the  matrix  elements  in  2  vectors:  one 
vector  a  for  the  elements  from  the  first  nonzero  entry  of  each  row  to  the  diagonal:  and  an 
associated  pointer  vector  b  for  recording  the  positions  of  the  diagonal  elements  of  M  as  entered  in 
a.    During  the  factorization  of  M,  only  the  elements  stored  in  a  need  to  be  modified. 

In  our  implementation,  we  use  the  modified  Cholesky  factorization  to  solve  this  linear 
system.  This  linear  solver  is  attractive  in  this  context  for  two  reasons.  First,  it  in  fact  guarantees 
that  the  effective  preconditioner  is  positive-definite.  Second,  application  of  a  direct  solver  is 
efficient  here.  Since  the  system  M  z  =  r  is  solved  repeatedly  within  CG  for  different  vectors  r. 
the  modified  Cholesky  factors  can  be  reused.  Only  backward  and  forward  substitutions  will  be 
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required  to  obtain  z  after  the  first  CG  iteration  of  every  outer  iteration.  For  the  deoxyribose 
application,  a  dense  linear  solver  is  used.  We  will  implement  a  sparse  version  of  the  modified 
Cholesky  factorization  in  future  applications. 

With  all  these  improvements,  our  model  code  of  the  truncated  Newton  method  based  on 
preconditioned  Conjugate  Gradients  for  potential  energy  functions  can  be  written  as  follows: 


INITIALIZE 


*     Set  the  following  parameters  and  functions 
(in  parentheses  we  indicate  our  choices): 
For  outer  loop:  e  (  10      ) 


-£-       lie 


ForCG:     5(10"'").    <t>k  =  min  j-J-  .||g*||  \  .  u.  <  10   -) 
For  line  search:    a  (  10"3)  .  3  (0.9) 

*  Supply  an  initial  guess  x0. 

*  Set  k  =  0  and  compute  £(xn)  and  g0 

OUTER  LOOP:      1.     TEST  FOR  CONVERGENCE:  if  ||  gk  \\  <  e.     stop. 

2.      COMPUTE  Hk  and  Mk  :  Assemble  the  Hessian  Hk  while  storing 
the  "local"  contributions  in  the  preconditioning  matrix  Mk  . 


INNER  LOOP : 

Set  p^    to  the  approximate  solution  ot 

using  the  preconditioned  Conjugate  Gradient  method  | 
( see  algorithm  (A3)). 


4.      LINE  SEARCH      Compute  \k  so  that  [12.  Section  6  3] 


E(\k  +  KkPk)  S  E(xk)  +  a^P* 


and 


5.      EXIT: 


E(xk  +  \tp4i  2  Eixk)  +  (3A.^g[pi- 
( implementation  is  discussed  later). 

Set    xt_]  =  \k  +  V  Vk 

Store  the  new  function  and  gradient  values 
E{\k,])      and      g^..     respectively 

k    -k+\ 
Goto  1 


3.  INNER  LOOP  (  *) 
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A 

INITIALIZE: 

Rename  gk  as  g.    Hk  as  H  and  Mk  as  M 

Set;  =  0.  po  =  0  .  r0  =  -g 

Solve  for  z(l  :            A/z,,  =  r„ 

Set  d0  =  z0 

B 

TEST  FOR  DIRECTION  OF  NEGATIVE  CURVATURE: 

[hd]7  =  Hd, 

It    (  dr[hd],  <  8  ||d7||:  )    then 

If   (y  =  0)    then      p  =  d, 

Else                              p  =  p7 

Endif 

"Goto      D 

Endif 

C 

CONJUGATE  GRADIENT  (ITERATION  j): 

rrz„ 

d>d], 

p,_i  =  p~  +  ad 

i>;  =  r.  -  a  [hd], 

*  =  mm    j  -^  .  ||g||   [ 

If      (   Hr^ili  <  <P  \\g\\  )      then 

P    =    P;-! 

Goto    D 

Endif 

Solve  for  z;_!  :          .V/z7_  j  =  r,_  t 

1  J  Ls 

d,.,    =    2,.!    +    B,d 

;-;+  i 

Goto     B 

D 

EXIT: 

Rename  p  as  p^ 

(A3) 
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3.3     Algorithmic  Properties 

The  truncated  Newton  version  described  here  requires:  1)  the  computation  of  one 
function  and  gradient  evaluation  for  every  outer  iteration  (more  if  needed  for  the  line 
search),  and  2)  one  Hessian/vector  product,  the  solution  of  a  sparse  linear  system,  and  Oin) 
additions  and  multiplications  for  every  inner  CG  iteration.  As  previously  remarked,  the 
Cholesky  factors  of  M  need  to  be  computed  only  once  per  outer  iteration,  after  which  they 
can  be  reused.  The  computational  cost  of  the  modified  Cholesky  factorization  depends  on  the 
sparsity  pattern  of  M.  For  dense  systems,  the  computational  cost  is  O(n'),  but  for  sparse 
systems  it  can  be  as  low  as  O(n). 

Storage  requirements  depend  critically  on  the  Hessian  structure  and  on  the  method  used 
to  compute  the  Hessian/matrix  product.  If  storage  of  0(n~)  is  available,  the  Hessian  and  the 
Hessian/vector  product  can  be  calculated  and  stored  in  the  standard  way.  However,  for 
systems  of  hundreds  or  thousands  of  atoms,  it  is  preferable  to  seek  more  modest  storage 
demands.  For  a  general  sparse  preconditioner  M,  only  the  nonzero  elements  need  to  be 
stored  when  adapting  a  sparse  linear  solver.  For  the  preconditioner  suggested  here,  this 
means  that  only  the  local  terms  (bond  length,  bond  angle  and  dihedral  angle)  need  to  be 
stored  explicitly.  This  idea  can  be  extended  further:  if  the  Hessian/vector  multiplication  can 
also  be  computed  without  the  second-derivative  non-bonded  terms,  then  only  a  fraction  of 
the  Hessian  elements  need  to  be  computed  at  all  !  Besides  considerably  reducing  storage  and 
derivative  computation  time,  this  strategy  (outlined  in  the  next  section)  will  eliminate  entirely 
the  need  for  cut-off  distances. 
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3.4     Algorithmic  Extensions 

In  general,  there  are  two  possibilities  for  calculating  the  Hessian/vector  product  without 
analytic  second-derivatives.  If  the  Hessian  is  sparse,  finite-difference  schemes  can  be 
considered.  The  sparsity  pattern  can  be  exploited  to  obtain  a  finite-difference  Hessian  from 
considerably  fewer  than  n  gradient  evaluations;  once  the  Hessian  is  available,  the 
Hessian/vector  product  can  be  computed  by  including  only  multiplication  with  the  non- 
vanishing  matrix  components. 

In  potential  energy  applications,  however,  the  Hessian  is  not  sparse  when  all  the  non- 
bonded  elements  are  considered.  Although  many  elements  may  be  small  in  magnitude,  their 
cumulative  effect  may  be  significant;  neglecting  these  terms  results  in  an  approximation  which 
is  unnecessary  in  the  context  of  CG.  Since  the  Hessian  matrix  itself  is  not  required,  a  better 
choice  is  to  approximate  the  Hessian/vector  product   by  a  special  finite-differencing  design: 

g(xk  +  hd)  -  g(\k) 

H(xk)d  =  - +  O(h)  .  (10) 

n 

The  finite -difference  interval  h  should  be  small  enough  to  provide  satisfactory  convergence. 
but  large  enough  to  avoid  numerical  instability  which  reduces  the  accuracy  of  the  computed 
product.  Since  g(Xj.)  is  already  available  for  the  inner  iteration,  only  one  additional  gradient 
evaluation  is  required  for  this  calculation.  Thus,  computing  Hd  by  this  formula  replaces 
every  analytic  Hd  calculation  in  the  outer  iteration  by  an  additional  gradient  evaluation  in 
every  inner  iteration. 

In  summary,  we  suggest  the  following  implementation  of  the  truncated  Newton  method 
for  large-scale  potential  energy  functions.  First,  for  the  preconditioner.  compute  the  local 
Hessian  elements  explicitly.  These  contributions,  particularly  our  polynomial  representations, 
are  relatively  easy  to  compute,  as  demonstrated  in  the  accompanying  paper  [30].  Second, 
approximate  the  matrix/vector  product  by  the  finite-difference  formula  above.  With  this 
variation,  the  storage  requirements  of  the  truncated  Newton  method  are  of  O(n),  and  the 
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overall  computational  cost  depends  on  the  number  of  inner  CG  iterations,  the  number  of 
function  and  gradient  evaluations,  and  the  cost  for  solving  the  linear  system  in  CG.  Thus, 
with  such  modifications,  truncated  Newton  methods  can  compete  directly  with  nonlinear  CG 
methods. 

Dembo  and  Steihaug  [6]  have  shown  that  convergence  of  the  truncated  Newton  method 
(using  the  analytic  HA  product)  is  quadratic  when  <t>k  is  chosen  as  in  (6).  The  algorithm  will 
always  terminate  with  satisfaction  of  the  truncated  Newton  criterion  (7)  near  a  strong  local 
minimum  as  long  as  p.  is  sufficiently  small.  In  the  unusual  case  that  the  algorithm  is  started 
from  a  stationary  point  (gk  =  0  but  Hk  is  not  positive-definite),  the  parameters  <£>k  and  8  can 
be  set  to  zero;  then,  except  for  rare  cases,  the  search  will  continue  toward  a  local  minimum. 
The  convergence  characteristics  of  the  variant  with  HA  approximated  by  (  10)  are  expected  to 
be  similar,  provided  the  finite-difference  interval  h  is  sufficiently  small. 

To  examine  all  these  properties  from  a  practical  point  of  view,  we  will  now  investigate 
the  performance  of  the  method  on  the  potential  energy  function  for  a  deoxyribose  model. 
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4.     COMPUTATIONAL  RESULTS 

To  compare  performance  of  various  optimization  algorithms,  one  often  needs  to  test  a 
wide  range  of  problems,  and  to  compare  features  of  computation  time,  number  of  function 
and  gradient  evaluations,  obtainable  accuracy,  storage  requirements,  progress  from  distant 
starting  points,  progress  at  indefinite  regions,  and  so  on.  The  most  effective  algorithm  for  a 
given  problem  is  designed  by  combining  the  best  features  observed  in  practice.  In  general. 
the  superiority  of  Newton-type  methods  over  gradient  methods  and  certainly  over  non- 
derivative  methods  has  been  well-illustrated  and  documented  in  the  literature.  For  difficult 
test  problems  such  as  '  R6senbrock"s  function  [11,  for  example],  Newton  methods  are 
successful  at  moving  rapidly  toward  local  minima,  whereas  gradient  methods  such  as 
nonlinear  CG  and  Steepest  Descent  cycle  indefinitely  or  take  an  excessively  long  time  to 
locate  the  local  minima  regions. 

Within  the  Newton  class  of  methods,  the  efficiency  of  truncated  Newton  methods  for 
large-scale  problems  has  already  been  demonstrated  for  several  problems  [6,  for  example]. 
Although  the  truncated  Newton  has  significantly  outperformed  full  Newton  methods  as  well 
as  the  nonlinear  CG  in  [6],  the  two  examples  were  strictly  convex  functions.  For  potential 
energy  applications,  it  is  essential  to  handle  indefinite  Hessian  regions. 

The  Model  Potential  Energy  Function.  Here,  we  compare  the  performance  of  the 
truncated  Newton  method,  other  Newton  methods,  and  the  nonlinear  CG  on  a  potential 
energy  function  of  a  model  deoxyribose.  As  we  have  described  in  the  accompanying  paper 
[30].  only  the  two  local  minima  corresponding  to  the  C2*— endo  and  C3"  — endo  puckering 
forms  were  obtained  from  a  wide  range  of  starting  points.  Performance  of  the  minimization 
runs  from  different  starting  points  was  very  similar,  and  we  thus  describe  a  set  of 
representative  runs  from  a  starting  conformation  corresponding  to  the  pseudorotation  phase 
angle  P  =  0  Although  the  molecular  model  involves  only  39  degrees  of  freedom,  the 
potential  energy  function  already  contains  the  major  energy  components:   bond  length  and 
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bond  angle  strain  terms,  torsional  potentials,  the  6-12  Lennard-Jones,  and  the  electrostatic 
potential  with  a  distance  dependent  dielectric  function.  Although  the  hydrogen-bond  potential 
is  not  used  in  the  present  calculation,  a  mathematical  form  similar  to  that  of  the  Lennard- 
Jones  potential  is  typically  used.  Our  experience  with  the  truncated  Newton  code  on  test 
problems  suggests  that  function  form,  rather  than  size,  governs  the  overall  performance. 
Therefore,  results  from  this  application  can  already  illustrate  important  features  of  the 
truncated  Newton  method  and  can  guide  us  in  future  algorithmic  development  and 
applications. 

Several  features  of  our  analytic  construction  of  the  potential  energy  function  are  relevant 
to  the  minimization:  1)  The  bond  length,  bond  angle  and  torsional  potentials  are  expressed 
as,  polynomials  of  the  cartesian  variables  in  order  to  facilitate  analytic  differentiation.  2)  A 
penalty  function  is  added  to  fix  the  first  three  atoms  to  the  x  ,  y  plane;  this  avoids  the 
problem  of  an  indefinite  Hessian  due  to  rotation  and  translation  invariance.  3)  The  code  for 
calculating  the  potential  energy  function  and  its  derivatives  is  written  in  a  flexible  and 
modular  form;  each  subprogram  handles  a  segment  of  the  potential  energy  function,  so  that 
assembly  of  the  Hessian  and    the  preconditioner  is  easy  and  efficient. 

Coding  details.  The  truncated  Newton  code  is  written  to  allow  variations  and  to 
facilitate  applications  to  different  potential  energy  functions.  For  example. 
Reverse  Communication  is  implemented  for  the  matrix/vector  product  of  CG.  This  means  that 
at  step  3B  of  CG,  control  is  returned  to  the  main  calling  program  in  order  to  perform  this 
operation.  This  procedure  avoids  imposing  specific  data  structures  within  the  CG  code,  and 
instead  allows  the  user  to  supply  an  external  subroutine  to  compute  the  Hessian/vector 
product.  This  subroutine  is  then  called  from  the  main  program  itself.  Similarly,  Reverse 
Communication  is  implemented  in  step  3C  of  the  CG  code  to  solve  the  linear  system  for  z. 

For  the  line  search,  a  simple  interval  backtracking  procedure  is  used.  When  the  value 
A.  =  1  does  not  satisfy  the  two  line  search  criteria  as  indicated  in  the  algorithm  (A2).  \  is 
scaled  by  0.5  .  In  practice,  in  almost  all  the  line  searches  for  various  test  functions  and  the 
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potential  energy  function,  a  value  of  \  =  1  was  already  satisfactory.  This  supports  the 
observation  that  the  search  direction  pk  generated  by  the  algorithm  is  well-scaled  [8].  To 
require  the  smallest  possible  number  of  function  evaluations  in  the  algorithm,  a  cubic 
interpolation  algorithm  will  be  used  for  the  line  search  in  future  calculations. 

Algorithms  tested.  All  calculations  described  below  were  performed  in  double  precision 
on  a  Vax  8600  computer  (approximately  14  decimal  digits  of  accuracy)  at  the  Courant 
Institute  of  Mathematical  Sciences.  New  York  University.  The  following  algorithms  are 
compared: 

1.  The  truncated  Newton, code  : 

A.  without  preconditioning  (TN-A) 

B.  with  diagonal  preconditioning  :  Mk  =  diag  {  Hk  }  (TN-B) 

C.  with  local  preconditioning  as  described,  call  this  preconditioner  Mk  (TN-C) 

D.  with   Mk  and  the  finite  — difference  approximation  for  HA  (TN-D) 

2.  Newton  methods         (using  analytic  second-derivatives)  : 

A.  the  truncated  Newton  code  TN-C  but  with  <$>k  =   10~10  (N-A) 

B.  Gill  and  Murray's  modified  Newton  code  (N-B) 

3.  Others       (without  analytic  second-derivatives)  : 

A.  nonlinear  Conjugate  Gradient  (CG) 

B.  BFGS,  a  quasi-Newton  method  (BFGS) 

We  compare  within  and  among  these  different  classes  in  turn. 
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4.1    Performance  of  the  Truncated  Newton  Variants 

In  Table  I,  we  compare  performance  of  all  variants  of  the  truncated  Newton  method. 
For  each  outer  iteration,  we  provide  the  gradient  norm  and  number  of  inner  iterations;  we 
then  summarize  the  cumulative  number  of  CG  iterations  for  the  run.  The  number  of  function 
calls  is  tabulated  in  Table  III  for  all  minimization  methods.  For  TN-A  through  TN-D  21 
function  calls  were  made.  The  function  and  gradient  are  computed  in  the  same  subroutine. 
so  the  number  of  gradient  evaluations  is  the  same.  The  maximum  allowed  number  of  CG 
iterations  was  set  to  n  +  10  =  49.  The  parameter  u,  in  the  truncated  Newton  criterion  was 
used  as  10~2  throughout.  We  found  that  the  performance  of  the  truncated  Newton  algorithm 
is  sensitive  to  the  choice  of  |x.  The  value  we  chose  produced  the  minimum  number  of  total 
CG  iterations  for  tested  u,  values  in  the  range  of  1  to  !0--.  ■ 

We  see  from  Table  I  that  the  number  of  outer  iterations  for  all  truncated  Newton 
variants  is  the  same  (17),  but  the  number  of  inner  CG  iterations  is  widely  different.  Clearly, 
without  preconditioning  (TN-A),  round-off  occurs  in  CG  and  many  iterations  are  performed. 
In  this  first  version,  almost  all  CG  loops  terminate  by  exceeding  the  allowable  number  of 
iterations,  and  not  by  satisfying  the  truncated  Newton  criterion.  With  diagonal 
preconditioning  (TN-B),  almost  as  many  CG  iterations  are  performed,  but  the  truncated 
criterion  is  satisfied. 

Tremendous  improvement  is  realized  with  the  preconditioner  XI  k  (TN-C).  The 
truncation  criterion  is  now  satisfied  in  a  fraction  of  the  number  of  CG  iterations  performed  in 
the  previous  two  versions,  typically  9  iterations.  The  cumulative  number  of  inner  loop 
iterations  for  this  version  is  137.  about  1/6  of  the  number  required  for  the  2  variants  above  ! 
This  demonstrates  not  only  the  necessity  and  power  of  preconditioning  in  Conjugate 
Gradient  methods  but  also  the  effectiveness  of  the  present  preconditioning  matrix. 

The  performance  of  the  truncated  Newton  version  with  the  finite-difference 
approximation  to  HA    (TN-D)  is  described  for  h  =  1CT4.  After  testing  h  ranges  of  10~"  to 
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10~h.  this  parameter  seemed  reasonable,  but  a  smaller  value  can  produce  slightly  better 
results.  As  expected,  the  performance  of  this  version-is  similar  to  the  optimal  version  above. 
Thus,  the  finite-difference  approximation  can  be  useful  for  large-scale  problems  to  avoid 
storage  and  computation  of  the  analytic  second-derivatives  of  the  non-bonded  terms. 


. 
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4.2    Performance  of  Truncated  Newton  vs.  Newton 

To  examine  the  effect  of  the  inexact  Newton  direction  on  the  overall  rate  of  convergence 
to  a  local  minimum,  we  compare  the  truncated  Newton  code  (TN-C)  with  2  Newton  methods 
in  Table  II.  The  first  Newton  method  (N-A)  is  the  same  truncated  Newton  code  with  one 
Fortran  statement  changed  -  the  truncation  criterion  becomes  very  strict  by  setting 
<$>k  -  10~10.  To  compare  with  a  Newton  method  that  solves  for  p^.  directly  rather  than 
iteratively,  we  tested  Gill  and  Murray's  modified  Newton  method  (N-B)  available  in  the 
NAG  Fortran  library  [27].  Testing  the  first  Newton  formulation  is  important  since  it  can  be 
directly  compared  to  the  truncated  Newton  version  TN-C.  Since  the  work  for  every  outer  and 
inner  iteration  is  identical  and  the  number  of  outer  iterations  is  the  same  in  both  methods,  the 
total  number  of  CG  iterations  can  be  directly  compared. 

As  we  see  in  Table  II,  about  14-15  inner  iterations  solve  for  pk  exactly  in  the  full 
Newton  method.  This  produces  nearly  twice  the  total  number  of  inner  iterations  as  the 
truncated  Newton  method!  Thus,  using  the  truncated  approach  results  in  a  significant 
reduction  in  computational  time  for  the  linear  system  A/z  =  r.  the  matrix,  vector 
multiplication,  and  the  other  CG  operations.  Clearly,  relaxing  the  required  accuracy  of  the 
Newton  direction  far  away  from  a  local  minima  does  not  sacrifice  the  overall  convergence  of 
the  method. 

Gill  and  Murray's  Newton  method  [18]  is  based  on  the  modified  Cholesky  factorization. 
Instead  of  solving  Hkpk  =  —  gk  by  preconditioned  CG,  a  related  positive-definite  matrix  Hk  is 
constructed  so  that 

H~k  =  Hk  +  Dk  (11) 

where  Dk  is  a  diagonal  matrix  with  non-negative  elements,  and  p^.  is  obtained  from  the 
modified  system  Hkpk  =  -  g*..  The  indefiniteness  of  Hk  is  detected  during  the  Cholesky 
factorization  itself,  and  the  diaaonal  elements  of  the  oriainal  matrix  are  increased  to  ensure 
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sufficient  positive-definiteness.    The  resultant  form  is  given  by 

H~k  =  rkD~krkT .  d2) 

where  Lk  is  lower  triangular  and  Dk  a  diagonal  matrix.  Once  this  factorization  is  available, 
\>k  is  determined  by  solving  the  two  simple  subproblems  of  backward  and  forward 
substitution,  each  in  closed  form: 

£*y  =  -  g*  and'       DkLk  T  pk=  y  .  (13) 

This  approach  requires  O(n')  operations  tor  every  Newton  iteration  and  O(n')  ot  storage.  By 
comparison,  the  truncated  Newton  approach  in  which  p;.  is  found  iteratively  requires  a 
computation  time  proportional  to  (the  number  of  CG  iterations)  x  (operations  per  CG 
iteration)  and  storage,  depending  on  the  version  adapted.  O(n)  to  0{n~).  Since  the 
computation  time  for  every  CG  iteration  is  proportional  to  (O(n)  +  operations  for  solving 
Mx  =  r  ),  a  combination  of  a  powerful  preconditioner  and  a  sparse  linear  solver  should 
produce  a  significant  advantage  in  computation  and  storage  for  large-scale  applications  of 
Newton  methods  that  implement  these  iterative  schemes  for  p^.. 

As  we  see  in  Tables  II  and  III.  both  our  Newton  code  and  NAG's  Newton  code 
converged  to  the  specified  gradient  tolerance  in  17  iterations  and  required  a  similar  number 
of  function  and  gradient  evaluations:  21  or  22.  Although  "charged  CPU  time"  is  hardly  an 
adequate  criterion  of  comparison,  especially  since  the  printed  output  for  our  truncated 
Newton- code  is  much  more  extensive,  the  time  for  both  these  minimization  runs  was  similar 
(7.0  minutes).  Thus,  the  truncated  Newton  version  based  on  solving  the  Newton  equation 
iteratively  is  already  competitive  with  the  Newton  method  based  on  the  direct  Cholesky 
factorization.  It  is  interesting  to  note  that  this  observation  is  in  direct  agreement  with 
O'Leary's  related  conclusions  that  discrete  Newton  methods  based  on  CG  are  superior  to  a 
discrete  Newton  method  based  on  the  modified  Cholesky  factorization  when  n  >  39  [8], 
Nonetheless,  an  advantage  of  Gill  and  Murray's  modified  Newton  method  is  the  fact  that  a 
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preconditioning  strategy  is  not  required  to  produce  a  reliable  scheme.  Unfortunately,  for 
large-scale  problems,  this  direct  approach  is  not  feasible  for  dense  Hessians.  Even  if  it  were 
feasible,  the  effort  would  not  be  justifiable  in  physically  unimportant  regions  of  the  multi- 
dimensional conformation  space. 


- 
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4.3   Performance  of  Methods  based  on  Analytic  First-Derivatives 

To  obtain  a  perspective  of  the  performance  of  all  these  Newton  codes  against  methods 
that  have  been  used  in  many  potential  energy  investigations  and  that  do  not  use  analytic 
second-derivatives,  we  apply  the  two  nonlinear  optimization  codes  in  Shanno's  and  Phua'a 
package  CONMIN  [28].  The  program  contains  two  state-of-the-art  codes  for  the  nonlinear 
CG  method  and  the  variable  metric  (or  quasi-Newton)  method  BFGS.  The  CG  method  is 
based  on  Beale's  CG  with  optimal  scaling  and  restarts  and  can  be  interpreted  as  a 
memoryless  variable-metric  method  [15].  The  BFGS  method  is  a  rank-2  update  with 
hereditary  positive-definiteness  which  incorporates  initial  Hessian  scaling  [29].  For  both 
algorithms,  the  convergence  criterion  is 

||  gk  ||  <  e  max  {  1  ,  \\xk\\  }  ,  (14) 

and  the  line  search  criteria  are 

E(xk  +  kkpk)  <  E(xk)  +  0.0001    kkg[pk  (15) 

and 

|pJg*+i/Ptgfc  I  <  0.9  .  (16) 

The  second  line  search  criterion  above,  an  alternative  to  our  second  criterion,  requires 
sufficient  rate  of  decrease  of  the  objective  function  along  pt.  For  the  CG  code,  at  least  two 
trial  X.'s  are  tested,  whereas  for  BFGS,  if  the  value  X  =  1  satisfies  the  two  criteria,  it  is 
automatically  accepted.  Storage  requirements  are  0{n)  double-precisioned  words  for  CG  and 
Odr)  tor  BFGS;  computational  requirements  per  iteration  are  O(n)  for  CG  and  0(n~)  for 
BFGS.  Thus,  to  compare  these  methods  to  each  other  and  to  the  previous  methods,  we  need 
to  compare  1)  the  number  of  function  and  gradient  evaluations  required  to  reach  a  certain 
accuracy,  and  2)  the  total  number  of  iterations  performed  in  CG  and  BFGS  to  the  total 
number  of  inner  iterations  in  the  truncated  Newton  codes. 
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In  Table  III  we  compare  the  results  of  the  above  two  algorithms  with  the  results  of  all 
previously  tested  algorithms.  Since  we  limited  the  number  of  gradient  and  function 
evaluations  to  1000.  and  the  CG  code  achieved  a  gradient  norm  of  1.3  x  10_l  by  that 
restriction,  we  can  compare  the  work  required  in  all  methods  to  reach  such  an  accuracy. 

The  CG  code  performed  492  iterations  and  involved  1001  function  calls,  while  the  BFGS 
code  required  215  iterations  and  221  function  calls  to  obtain  the  same  accuracy.  Thus,  the 
quasi-Newton  method  outperforms  CG  as  expected:  unfortunately,  in  general,  memory 
considerations  mandate  the  use  of  CG  for  large  problems. 

In  comparing  CG  and  BFGS  with  all  other  methods,  we  find  the  following.  First,  the 
accuracy  obtained  in  all  other  Newton  codes  is  better.  Since  convergence  in  the  last  steps  of 
Newton  methods  is  quadratic,  it  is  much  easier  to  achieve  a  small  gradient  norm  rapidly. 
Thus,  in  comparing  the  work  for  all  methods  to  satisfy  a  convergence  criterion  of  1.3  x 
10_f>  ,  we  are  actually  over-approximating  the  work  for  the  TN  and  N  codes.  Second,  the 
cumulative  number  of  inner  iterations  for  the  truncated  Newton  codes  with  the  preconditioner 
Mk  is  less  than  the  number  of  iterations  for  both  CG  and  BFGS:  137  for  TN-C  and  148  for 
TN-D.  For  the  Newton  code  N-A.  the  total  number  of  iterations  is  239.  slightly  more  than 
for  BFGS  and  about  1/2  that  of  CG.  Third,  and  most  significant,  the  number  of  function  calls 
is  considerably  less  for  all  Newton  versions  ,  about  1/10  the  number  for  BFGS  and  1  50  the 
number  for  CG  !  Thus,  Newton  methods  are  far  superior,  as  the  dominant  cost  factor  in 
large-scale  optimization  is  the  number  of  function  and  gradient  evaluations.  With 
adaptations  to  limited  storage  and  analytic  second-derivative  manipulation,  the  truncated 
Newton  class  is  the  most  efficient  and  powerful  of  the  various  methods  considered  in  this 
paper. 
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4.    CONCLUSIONS 

In  this  paper,  we  have  attempted  to  encourage  the  Newton  approach  and  the  use  of 
analytic  second-derivative  information  tor  potential  energy  minimizations.  We  have 
demonstrated  that  the  inhibiting  storage  and  computational  requirements  of  Newton  methods 
can  be  made  manageable  in  the  truncated  Newton  approach,  while  the  method  still  retains  the 
exceptional  local  quadratic  convergence  rate  of  classic  Newton  methods. 

The  truncated  Newton  algorithm  presented  here  combines  several  attractive 
features:  1)  inexact  search  directions  far  away  from  local  minima,  2)  iterative  solution  of 
the  Newton  equation  by  the  linear  Conjugate  Gradient  method,  and  3)  preconditioning  the 
Newton  equation  by  a  physically  meaningful  part  of  the  Hessian.  With  the  finite-difference 
approximation  to  the  Hessian/vector  product  and  a  sparse  direct  linear  solver  for  the 
preconditioning  matrix  M,  the  storage  and  computational  demands  are  competitive  with  those 
of  the  nonlinear  Conjugate  Gradient,  but  performance  is  more  predictable  and  efficient. 

The  small  molecular  system  examined  here  provides  a  perspective  on  the  performance  of 
various  minimization  methods  and  permits  an  investigation  of  algorithmic  variations  and 
parameters.  In  particular,  full  Newton  methods  based  on  the  Cholesky  factorization  can  still 
be  implemented  on  a  Va.x  computer.  Since  we  believe  that  performance  of  the  truncated 
Newton  method  is  governed  more  by  function  form  rather  than  problem  size,  application  of 
efficiently-implemented  truncated  Newton  codes  is  promising  for  potential  energy 
minimizations. 

For  very  large  molecular  systems,  several  issues  deserve  careful  consideration.  First,  the 
potential  energy  function  should  be  formulated  so  that  differentiation  is  efficient  (i.e.  fast  and 
accurate).  This  can  be  accomplished,  for  example,  by  a  direct  formulation  of  the  energy 
function  in  terms  of  the  independent  variables.  Since  the  Lennard-Jones  and  electrostatic 
terms  are  most  naturally  expressed  in  terms  of  the  interatomic  distances,  cartesian 
coordinates  are  preferred  as  conformational  variables.  Althoush  internal  coordinates,  such  as 


-  31  - 

dihedral  angles  or  constraints  on  the  accessible  conformational  space  (e.g.  the  pseudorotation 
pathway),  are  tempting  to  incorporate  (since  they  reduce  the  number  of  degrees  of  freedom). 
such  formulations  are  cumbersome;  they  require  coordinate  construction  at  every  iteration 
and  repeated  application  of  the  chain  rule.  In  the  present  formulation,  for  example,  a  direct 
and  simple  energy  function  was  constructed  by  allowing  the  full  degrees  of  freedom  for  the 
sugar  ring  and  by  expressing  bond  length,  bond  angle  and  dihedral  angle  terms  as 
polynomials  of  the  cartesian  variables. 

Second,  atomic  numbering  schemes  are  crucial.  A  particular  numbering  scheme  should 
be  devised  so  that  the  Hessian  part  from  the  local  energy  terms  will  display  a  particular 
sparsity  pattern,  perhaps  a  banded  structure.  For  chain  polyruiclefitides  such  as  nucleic  acids 
this  will  certainly  be  possible.  The  specific  sparsity  pattern  should  then  be  incorporated  into 
the  linear  solver  and  perhaps  into  the  finite-difference  approximation  to  HA  to  improve 
computational  efficiency. 

Third,  various  parameters  and  segments  of  the  truncated  Newton  code  can  be  modified 
to  suit  a  particular  problem  structure.  For  different  potential  energy  applications,  alternative 
forcing  sequences  <&k  and  preconditioning  strategies  should  be  examined.  An  alternative 
preconditioning  strategy  might  consist  of  the  local  Hessian  elements  and  those  short-range 
non-bonded  terms  whose  contribution  is  thought  to  be  significant  to  the  conformational 
energy.  Alternative  choices  for  the  search  directions  in  case  of  negative  curvature  can  be 
considered  as  well.  When  j  >  0  at  step  3B  of  the  algorithm  (A3),  the  exiting  direction  could 
be  :  i)  —  g  ,  the  steepest  descent  direction:  ii)  d0  ,  a  direction  interpolating  between  the 
steepest  descent  direction  and  the  Newton  direction:  or  Hi)  dy  .  a  guaranteed  direction  of 
negative  curvature  [6].  Such  strategies  may  be  important  for  cases  when  little  progress  is 
realized  at  indefinite  Hessian  regions. 

Fourth,  efficient  implementation  of  the  truncated  Newton  code  on  supercomputer 
architecture  can  further  extend  their  range  of  applicability  for  potential  energy  functions. 
Vectorization  can  be  obtained  in  the  following  areas:  nested  loops  for  function  and  derivative 
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computation,  basic  linear  algebra  routines,  and  parallel  treatment  of  repeating  molecular 
subunits.  We  are  currently  examining  these  modeling  and  minimization  issues  for  larger 
DNA  sequences. 
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TABLE  LEGENDS 

Table  I.  Comparison  of  minimization  runs  for  a  deoxyribose  model  of  39  variables.  The  starting 
conformation  corresponds  to  the  pseudorotation  phase  angle  P  =  0°  and  (relative)  energy  of  282.26 
kcal/mol.  The  energy  is  high  from  the  contribution  of  the  energy  penalty  term  for  translation  and 
rotation  invariance.  TN-A  is  the  truncated  Newton  code  without  preconditioning,  TN-B  has  diagonal 
preconditioning,  TN-C  has  the  "local"  preconditioning  as  described  in  the  text,  and  TN-D  is  the  version 
with  the  finite-difference  approximation  to  the  Hessian/vector  product  in  Conjugate  Gradient.  The 
obtained  local  minimum  corresponds  to  the  C3'-endo  puckering  form,  with  11.87  kcal/mol. 

The  convergence  criterion  for  all  methods  is  a  gradient  norm  less  than  1.0  E  -  7  ,  and  the  finite- 
difference  interval  h  in  TN-D  (see  equation  (10))  is  1.0  E  -  4.  For  each  method,  the  gradient  norm 
and  number  of  inner  loop  iterations  is  given  for  each  outer  Newton  iteration.  The  cumulative  number 
of  CG  iterations  is  given  at  the  bottom.  As  can  be  seen  from  the  gradient  norm  progress  in  these 
versions,  the  performance  is  very  similar  in  all  cases,  but  the  different  preconditioning  strategies  affect 
significantly  the  number  of  required  CG  iterations.  With  the  "local"  preconditioner  of  TN-C,  the  search 
proceeds  much  more  efficiently  than  versions  TN-A  and  TN-B,  where  round-off  error  occurs.  The 
performance  of  the  finite-difference  version  TN-D  is  almost  identical  to  that  in  TN-C. 


Table  II.  Comparison  of  minimization  runs  for  the  same  molecular  model  as  described  in  Table  I. 
The  truncated  Newton  version  TN-C  is  compared  to  N-A,  a  Newton  method  obtained  with  the 
truncated  Newton  code  with  the  forcing  sequence  <t>-  *  1.0  E  -10,  and  N-B,  Gill  and  Murray's 
modified  Newton  based  on  the  Cholesky  factorization.  Clearly,  the  performance  of  TN-C  and  N-A  is 
similar  except  that  the  truncated  Newton  code  proceeds  more  efficiently  toward  the  minimum,  with  less 
CG  iterations.  These  results  indicate  that  even  a  Newton  method  based  on  an  accurate  solution  to  the 
Newton  equation  can  be  fast  when  a  good  preconditioner  is  used.  The  progress  in  Gill  and  Murray's 
method  N-B  is  similar  to  that  of  the  methods  above:  17  Newton  iterations  are  required  for  convergence, 
and  22  function  and  gradient  evaluations  in  comparison  to  21  for  N-A  and  TN-C.  This  illustrates  that 
for  this  small  problem,  performance  of  a  Newton  method  based  on  an  iterative  solution  of  the  Newton 
equation  is  already  competitive  with  a  Newton  method  that  implements  a  direct  solver.  Thus, 
superiority  of  the  iterative  approach  is  expected  for  larger  problems. 


Table  III.  Comparison  of  all  minimization  runs  for  the  same  molecular  model  as  described 
above.  TN-A  through  TN-B  are  the  4  truncated  Newton  versions,  N-A  and  N-B  the  2  Newton  versions, 
CG  is  Beale's  nonlinear  Conjugate  Gradient  with  optimal  scaling  and  restarts  and  BFGS  is  the  quasi- 
Newton  method  based  on  a  rank-2  update  with  hereditary  positive-definiteness.  The  last  two  methods 
are  implemented  by  Shanno's  and  Phua's  program  CONMIN  [28].  For  the  truncated  Newton  codes,  the 
number  of  outer  and  inner  iterations  are  given,  and  for  the  others,  the  number  of  iterations.  All 
algorithms  are  compared  for  a  convergence  criterion  of  gradient  norm  less  than  1.3  E  -6,  which  is  the 
value  obtained  for  the  CG  algorithm  after  the  maximum  allowed  function  calls  -  1000  -  were  made. 
We  see  that  such  a  convergence  criterion  produces  a  lower  gradient  norm  for  all  TN  and  N  codes. 
Thus,  the  work,  required  for  these  methods  is  actually  an  over-approximation  in  comparison  to  the  work 
required  for  BFGS  and  CG.  The  results  illustrate  the  superiority  of  the  truncated  Newton  codes  and  the 
Newton  codes  based  on  analytic  second-derivatives  over  BFGS  and  CG  :  for  all  Newton  methods,  the 
number  of  function  calls,  the  dominant  cost  factor  in  large-scale  problems,  is  a  small  fraction  of  that 
required  in  BFGS  and  CG. 


Table  I 
Comparison  of  the  Truncated  Newton  Variants 


TN-A 


TN-B 


TN-C 


TN-D 


ITR 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 


5.52E  +  2 
4.99E  +  2 
1.62E+2 
5.85E  +  2 
2.52E  +  1 
1.41  E  +  1 
7.22E  +  1 
1.09E  +  1 
2.91E  +  0 
1.40E  +  1 
4.80E-  1 
2.24E  +0 
9.70E-  1 
3.30E-  1 
2.95E-2 
3.92E-4 
1.34E-5 
9.06E-8 


#CG 
Itera- 
tions 


49 
37 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 
49 


821 


llgll 


5.52E  +  2 
6.7TE  +  2 
1.65E  +  2 
5.81E  +  1 
3.52E  +  1 
1.51E  +  1 
1.22E  +  1 
5.98E+0 
1.36E  +  1 
2.1  IE  +  0 
6.85E  +  0 
2.52E  +  0 
4.69E  +  0 
5.20E-  1 
1.02E  +  0 
1.62E-2 
1.95E-3 
5.37E-8 


#CG 
Itera- 
tions 


48 
35 
39 
44 
47 
47 
48 
49 
48 
49 
48 
48 
48 
49 
48 
49 
49 


793 


i  ig 


5.52E+2 
6.71E  +  2 
1.65E  +  2 
5.81E  +  1 
3.52E  +  1 
1.51E  +  1 
1.22E  +  1 
5.98E  +  1 
1.36E  +  1 
2.12E+0 
6.83E  +  0 
2.54E  +  0 
4.67E+0 
5.30E-  1 
1.02E  +  0 
1.67E-2 
2.00E-3 
5.63E-8 


#CG 
Itera- 
tions 


llgi  I 


#CG 
Itera- 
tions 


12 
5 
6 
7 
8 
9 
8 
9 
8 
9 
8 
9 
9 

10 
9 

10 
9 


137 


5.52E  +  2 
6.70E  +  2 
1.65E  +  2 
5.81E  +  1 
3.52E  +  1 
1 .51 E  +  1 
1.22E  +  1 
5.97E+0 
1.36E  +  1 
2.1  IE  +  0 
6.88E+0 
2.50E+0 
4.73E+0 
5.09E-  1 
1.03E  +  0 
1.57E-2 
1.89E  -3 
4.62E-8    : 


14 
5 
6 
7 
8 
9 
8 
9 
8 

10 
8 
9 
9 

10 
9 

10 
9 


148 


Table  II 
Truncated  Newton  vs.  Newton 


TN-C 


N-A 


N-B 


ITR 

llgll 

#CG 
Itera- 

llgl I 

#CG 
Itera- 

Ilgi I 

tions 

tions 

i 

1 

5.52E  +  2 

12 

5.52E  +  2 

18 

5  52E  +  2 

2 

6.71E  +  2 

5 

6.71E  +  2 

11  ' 

5.29E  +  2 

3 

1.65E  +  2 

6 

1.65E  +  2 

13 

2.64E  +  2 

4 

5.81E  +  1 

7 

5.80E  +  1 

13 

5.88E  +  1 

5 

3.52E  +  1 

8 

3.52E  +  1 

14 

4.75E  +  1 

6 

1.51E  +  1 

9 

1.51E  +  1 

14 

1.07E  +  1 

7 

1.22E  +  1 

8 

1.22E  +  1 

14 

1.92E  +  1 

8 

5.98E+0 

9 

5.98E  +  0 

15 

2.04E+  1 

9 

1.36E+  1 

8 

1.36E  +  1 

14 

1.00E  +  1 

10 

2.12E  +  0 

9 

2.12E  +  0 

15 

2.97E  +0 

11 

6.83E+0 

8 

6.85E  +  0 

14 

5.98E  +0 

12 

2.54E  +0 

9 

2.52E+0 

15 

2.18E  +0 

13 

4.67E+0 

9 

4.69E+0 

14 

3.40E  +  0 

14 

5.30E-1 

10 

5.20E-  1 

15 

4.03E-  1 

15 

1.02E+0 

9 

1.02E+0 

15 

4.43E-  1 

16 

1.67E-2 

10 

1.63E-2 

14 

5.94E-3 

17 

2.00E-3 

9 

1.96E-3 

12 

1.36E-4 

18 

5.63E-8 

— 

5.41E-8 

— 

5.20E-10 

137 

239 

Table  III 
Comparison  of  all  Minimization  Methods 


Method 

T"  '"  '  '  ' U-J LJJJ 

Total  Iterations 
Outer/Inner 

Total  E  and  g 
Evaluations 

Final 

llgll 

TN-A 

17/821 

21 

9.06E-8 

TN-B 

17/793 

21 

5.37E-8 

TN-C 

17/137 

21 

5.63E-8 

TN-D 

17/156 

21 

4  62E  -8 

N-A 

17/239 

21 

5.41E-8 

N-B 

17 

22 

5.20E-10 

CG 

491 

1001 

1  30E  -6 

BFGS 

215 

221 

1  27E  -6 
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SUPPLEMENTARY  NOTES 


-    References  in  the  text  preceeded  by  the  letter  N    are  provided  in  this  section    - 


[1]  Global  convergence  refers  to  the  convergence  of  a  method  to  a  local  minimum  from  an  arbitrary 
starting  point  x0.  Convergence  to  the  global  minimum  of  a  function  is  a  separate  problem  entirely. 

[2]  The  reader  may  be  confused  between  the  terms  linear  CG  and  nonlinear  CG.  The  linear  CG  is  an 
iterative  method  for  solving  a  linear  system  A  x  =  b  where  A  is  a  positive-definite  (symmetric) 
matrix.  The  nonlinear  CG  is  a  minimization  method  by  its  own  right  which  uses  only  function  and 
gradient  values  of  the  objective  function.  The  two  methods  are  actually  identical  when  the 
nonlinear  function  to  be  minimized  is  quadratic  and  convex. 

[3]      A  symmetric  matrix  A  is  said  to  be  positive -definite  if  the  quadratic  form  u  Au  >  0  for  all  nonzero 

vectors  u.    Similarly,  A  is  negative -definite  if  urAu  <  0  for  all  nonzero  vector  u.    A  is  indefinite  if 

urAu  is  positive  for  some  u  and  negative  for  others. 

d~E(x\  d~E(x) 

In  particular,  the  Hessian  matrix  is  symmetric  (i.e.    H  ,(x)  =  H ,(x))  since  — ^  =   -*-.    It 

'  dxt  dxt  dx.  dx, 

is   not   positive-definite    in   general,    but   it   is   positive-definite   in   the    neighborhood   of   a   local 

minimum  of  E . 

[4]  A  descent  direction  p  is  a  vector  that  satisfies  g^p  <  0.  This  condition  ensures  that  the  objective 
function  is  decreased  along  the  direction  p.  To  see  why,  consider  the  Taylor  expansion  of  E  about 
xk  along  p  scaled  by  \: 

E(xk  +  kp)  =  E(xt)  +  \g.  T  p  +   j  \2  p  T  H(x,   -   \6p)  p  (1) 

where  0  <  p  <  1  and  0  <  \  <  1.  Given  a  descent  direction  p,  there  exists  some  positive  scalar  \ 
such  that  for  all  \  <  \,  the  linear  term  in  X  dominates  in  magnitude  over  the  quadratic  term.  That 
is, 

kg*  T  P  -   y  *:  P  T  H(xk  ~  \8p)  p  <  0  ,  (2) 

or  equivalently, 

E(x,-\p)  <  E(x.)  (3) 

as  desired. 

[5]  The  determination  of  an  appropriate  choice  for  the  step  length  along  the  search  direction  is  quite 
a  complicated  business,  and  is  the  central  question  of  practical  line  search  algorithms.  Intuitively  it 
makes  sense  to  require  that  the  function  value  is  sufficiently  decreased,  i.e. 

E(x,+  |)  <  £(x,)  ,  (4) 
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but  this  simple  condition  does  not  guarantee  that  the  sequence  of  x  iterates  will  converge  to  the 
minimizer  of  E.  There  are  various  conditions  that  can  be  added  to  ensure  convergence.  The 
simplest  one  is  a  requirement  that  the  rate  of  decrease  of  E  in  the  direction  p.  at  the  new  iterate 
x.._  |  will  be  sufficiently  smaller  than  the  rate  of  decrease  of  E  in  the  direction  p,  at  \k.  For 
details,  see  for  example  [28],  page  116,  or  [11],  page  100. 

[6]  Derivatives  may  be  approximated,  rather  then  calculated,  when  the  objective  function  is 
continuously-differentiable.  Such  a  function  can  be  expanded  in  a  Taylor  series.  For  a  univariate 
function  f(x)  the  Taylor  series  can  be  arranged  as 


/(*-*)-/(*)    =/'(x)+0(A),  (5) 

h 


where  the  quantity  h  is  known  as  the  finite-difference  interval,  and  the  expression  0{h)  refers  to 

the  magnitude  of  the  error  made  in  truncating  the  Taylor  series,  an  error  proportional  to  h  when  h 

is  small. 

■ 
Second  derivatives  can  be  approximated  bv  considering  the  two  Tavlor  expansions 

/(x  +  A)  =  f(x)  +■  hf  '(*)  *  ~/  "(*)  -  ^-f  '"(*)  *  0(A4)  (6) 

z  o 


f(x-h)  =  f(x)  -  hf  '(*)  +   £f  "(x)  -   £f  "'(x)  -  0(A4)  (7) 

Z  O 


Forming  the  linear  combination  of  these  terms  that  eliminates  the  first  derivative  terms,  we  get 


/(*  +  »)  -2/(*)  */(*-*)    =  f  "W  +  Q(^)  .  (8) 

h- 


The  left  hand  side  of  this  equation  gives  an  approximation  to/    (x),  accurate  to  order  hl . 
Extension  to  multivariate  functions  is  as  follows.    For  a  rwice-continuously  differentiable  function 
F(x,,x:,  .  .  .  ,x„),  we  associate  n  finite  difference  intervals  h..  The  approximation  to  ^(x)  along 
the  basis  vector  e.  is  given  by 

F(x  +  h}tj)  =  F(x)  -  /rg(x)re,  -   I*?  ej H{x+kjtj)  e}  +0(|/i;|3)  .  (9) 

The   term   g(x)re;    is  just  the  j-th  element  of    the  gradient  g(x),   and   eT  H(x-  h.e.)  e  ,    the  ;-th 
diagonal  element  of  H(x~h/e  ).  We  then  obtain  a  formula  for  the  y-th  gradient  component: 

g;(x)  =    i-  [  F(x+hjtj)-F(x)  ]  -  O(h)  ,  (10) 
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analogous  to  (5). 

A  first  order  approximation  to  the  ij  element  of  the  Hessian  H(x)  is  given  by: 

HJx)  =   —  [F(x~h.e,~h,en  -  F(x  +  «.e,)  -  F(x-he)  +  F(x)\    -  0(min(«,,fl  ))  .  (11) 

hjhl 

This  formula  can  be  derived  by  replacing  both  gradient  values  in  the  approximation  to  the  ;-th 
column  of  H(x) 

-Mg(x-/re;)-g(x)],  (12) 

ft: 

J 

by  first-order  differences  in  F(x). 

Approximating  the  gradient  by  finite-differences  requires  order  of  n  function  evaluations,  and  the 
Hessian  order  of  n1  operations.  Careful  implementation  of  the  numerical  difference  scheme  is 
required  (for  example,  in  determining  the  proper  difference  intervals)  in  order  to  minimize  the 
sum  of  the  truncation  and  rouding  errors  that  are  introduced.  The  approximations  above  are 
generally  first-order  accurate. 

[7]  The  convergence  properties  of  an  algorithm  are  described  by  two  analytic  quantities:  convergence 
order  and  convergence  ratio.  Let  the  sequence  x0  ,  x,  ,  x, ",  .  .  .  converge  to  x  ,  i.e. 
lim  k_x  \\xk  -  x"||  =0.  The  sequence  is  said  to  converge  to  x~  with  order  p  if  p  is  the  largest  non- 
negative  number  for  which  the  limit  3  exists  and  is  finite,  where 

0  -  p  =  lim  t_x„      '  ~.X,      .  (13) 


When  p  =  1  and  (3<  1  the  sequence  is  said  to  converge  linearly  (e.g.  x<  =  2  *  ,  n=  1),  when  p  =  1 
and  3  =  0  the  sequence  converges  superlinearly  (e.g.  xi  =  i"<  ,/i  =  l),  and  when  p=2  the 
convergence  is  quadratic  (e.g.  xk=2'2  ,  n=l).  The  constant  3  is  the  associated  convergence 
ratio. 

[8|  Density  of  a  matrix  is  a  measurement  given  by  the  ratio  of  the  number  of  nonzero  to  the  number 
of  zero  matrix  components.  A  matrix  is  said  to  be  dense  when  this  ratio  is  large  and  sparse  when 
it  is  small. 
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